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One of the main challenges in the field of embodied artificial intelligence is the open-ended 
autonomous learning of complex behaviors. Our approach is to use task-independent, 
information-driven intrinsic motivation(s) to support task-dependent learning. The work 
presented here is a preliminary step in which we investigate the predictive information 
(the mutual information of the past and future of the sensor stream) as an intrinsic drive, 
ideally supporting any kind of task acquisition. Previous experiments have shown that 
the predictive information (PI) is a good candidate to support autonomous, open-ended 
learning of complex behaviors, because a maximization of the PI corresponds to an 
exploration of morphology- and environment-dependent behavioral regularities. The idea 
is that these regularities can then be exploited in order to solve any given task. Three 
different experiments are presented and their results lead to the conclusion that the 
linear combination of the one-step PI with an external reward function is not generally 
recommended in an episodic policy gradient setting. Only for hard tasks a great speed-up 
can be achieved at the cost of an asymptotic performance lost. 

Keywords: information-driven self-organization, predictive information, reinforcement learning, embodied artificial 
intelligence, embodied machine learning 



1. INTRODUCTION 

One of the main challenges in the field of embodied 
artificial intelligence (EAI) is the open-ended autonomous 
learning of complex behaviors. Our approach is to use task- 
independent, information-driven intrinsic motivation to support 
task-dependent learning in the context of reinforcement learning 
(RL) and EAI. The work presented here is a first step into this 
direction. RL is of growing importance in the field of EAI, mainly 
for two reasons. First, it allows to learn the behaviors of high- 
dimensional and complex systems with simple objective func- 
tions. Second, it has a well-established theoretical (Sutton and 
Barto, 1998; Bellman, 2003) and biological foundation (Dayan 
and Balleine, 2002). In the context of EAI, where the agent has 
a morphology and is situated in an environment, the concepts of 
the agent's intrinsic and extrinsic perspective rise naturally. As a 
direct consequence, several questions about intrinsic and extrin- 
sic reward functions, denoted by IRF and ERF, follow from the 
EAI's point of view. The questions that are of interest to us are; 
what distinguishes an IRF from an ERF, what is a good candidate 
for a first principled IRF, and finally, how should IRFs and ERFs 
be combined? 

The first question, of how to distinguish between IRF and 
ERF is addressed in the second section of this work, which starts 
with the conceptual framework of the sensorimotor loop and its 
representation as a causal graph. This leads to a natural distinc- 
tion of variables that are intrinsic and extrinsic to the agent. We 
define an IRF that models an internal drive or motivation as a 



task-independent function which operates on the agent's intrin- 
sic variables only. In general, an ERF is a task-dependent function 
that may operate on intrinsic and extrinsic variables. 

The main focus of this work is the second question, which 
deals with finding a first principled IRF. We propose the predictive 
information (PI) (Bialek et al., 2001) for the following reasons. 
Information-driven self-organization, by the means of maximiz- 
ing the one-step approximation of the PI has proved to produce 
a coordinated behavior among physically coupled but otherwise 
independent agents (Ay et al, 2008; Zahedi et al., 2010). The 
reason is that the PI inherently addresses two important issues 
of self-organized adaptation, as the following equation shows: 
I(S t ; S t +i) = H(Sf+i) — H(S t +i\S t ), where S t is the vector of 
intrinsically accessible sensor values at time f . The first term leads 
to a diversity of the behavior, as every possible sensor state must 
be visited with equal probability. The second term ensures that the 
behavior is compliant with the constraints given by the environ- 
ment and the morphology, as the behavior must be predictable. 
This means that an agent maximizing the PI explores behavioral 
regularities, which can then be exploited to solve a task. In a dif- 
ferently motivated work, namely to obtain purely self-organizing 
behavior, a time-local version of the PI was successfully used to 
drive the learning process of a robot controller (Martius et al., 
2013). A similar learning rule was obtained from the principle 
of Homeokinesis (Der and Martius, 2012). In both cases a gradi- 
ent information was derived to pursue local optimization. For the 
integration of external goals a set of methods have been proposed 
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by (Martius and Herrmann, 2012), which, however, cannot deal 
with the standard reinforcement setting of arbitrary time-delayed 
rewards that we study here. Prokopenko et al. (2006) used the 
PI, estimated on the spatio-temporal phase-space of an embodied 
system, as part of fitness function in an artificial evolution setting. 
It was shown that the resulting locomotion behavior of a snake- 
bot was more robust, compared to the setting, in which only the 
traveled distance determined the fitness. 

The third question, which deals with how to combine the IRF 
and ERF, is in the focus of the ongoing research that was briefly 
described above and of which this publication is a first step. As 
the PI maximization is considered to be an exploration of behav- 
ioral regularities, it would be natural to exchange the exploration 
method of a RL algorithm by a gradient on the PI. The work pre- 
sented here is a preliminary step in which we concentrate on the 
effect of the PI in a RL context to understand for which type of 
learning problems it is beneficial and in which it might not be. 
Therefore, we chose a linear combination of IRF and ERF in an 
episodic RL setting to evaluate the PI as an IRF in different exper- 
iments. Combining an IRF and an ERF in this way is justified as 
ERFs are often linear combinations of different terms, such as one 
term for fast locomotion and another for low energy consump- 
tion. Nevertheless, the results of the experiments presented in this 
work show that the one-step PI should not be combined in this 
way with an ERF in an episodic policy gradient setting. 

We are not the first to address the question of IRF and ERF 
in the context of RL and EAI. This idea goes back to the pio- 
neering work of Schmidhuber (1990) and is also in the focus 
of more recent work (Kaplan and Oudeyer, 2004; Schmidhuber, 
2006; Oudeyer et al., 2007) which are based on the prediction 
progress and Barto et al. (2004), who considers the predic- 
tion error. In Storck et al. (1995); Yi et al. (2011) an intrin- 
sic reward for information gain was proposed (KL- divergence 
between subsequent models), which results in their experiments 
in a state-entropy maximization. A different approach (Little and 
Sommer, 2013) uses a greedy policy on the predicted informa- 
tion gain of the world model to select the next action of an 
agent. However, only discrete state/action spaces have been con- 
sidered in both approaches. A similar work (Cuccu et al., 2011) 
uses compression quality as the intrinsic motivation, which was 
particularly beneficial because it performed a reduction of the 
high-dimensional visual input space. In comparison to our work 
only one experiment (comparable to the self-rescue task below) 
with a one-dimensional action-space was used without consid- 
ering asymptotic performance, which is where we found most 
problems. 

This paper investigates continuous space high-dimensional 
control problems where random exploration becomes difficult. 
The PI, measured on the sensor values, accompanies (and might 
eventually replace) the exploration of a RL method such that the 
policy adaptations are conducted compliant to the morphology 
and environment. The actual embodiment is taken into account, 
without modeling it explicitly in the learning process. 

The work is organized in the following way. The next section 
gives an overview of the methods, beginning with the sensorimo- 
tor loop and its causal representation. This is then followed by a 
presentation of the PI and the episodic RL method PGPE (Sehnke 



et al., 2010). The third section describes the results received by 
applying the methods to three experiments, and the last section 
closes with a discussion. 

2. METHODS 

This section describes the methods used in this work. It begins 
with the conceptual framework of the sensorimotor loop. This is 
then followed by a discussion of the PI and entropy, which both 
are used as IRF in all presented experiments. Finally, the RL algo- 
rithm utilized in this work is introduced as far as it is required to 
understand how the results were obtained. 

2.1. EMBODIED AGENTS AND THE SENSORIMOTOR LOOP 

There are three main reasons why we prefer to experiment with 
embodied agents (EA). First, scalability: EA are high-dimensional 
systems which live in a continuous world. Hence, the algorithms 
face the curse of dimensionality if they are evaluated on different 
EAs. Second, validation: we are interested in understanding natural 
cognitive systems by the means of building artificial agents (Brooks, 
1991). Using EA ensures that the models are validated against 
the same (or similar) physical constraints that natural systems 
are exposed to. Third, guidance: there is good evidence that the 
constraints posed by the morphology and environment can be used 
to reduce the required controller complexity, and hence, reduce 
the size of the search space for a learning algorithm (Zahedi et al., 
2010; Pfeifer and Bongard, 2006). Consequently, understanding 
the interplay between the body, brain and environment, also called 
the sensorimotor loop (SML, see Figure 1), is a general focus of 
our work. The next paragraph will introduce the general concept 
of the SML and discuss its representation as a causal graph. 

A cognitive system consists of a brain or controller that sends 
signals to the system's actuators, which then affect the system's 
environment. We prefer the notion of the system's Umwelt (von 
Uexkuell, 1934; Clark, 1996; Zahedi et al, 2010; Zahedi and Ay, 
2013), which is the part of the system's environment that can be 
affected by the system, and which itself affects the system. The 
state of the actuators and the Umwelt are not directly accessible to 
the cognitive system, but the loop is closed as information about 
both, the Umwelt and the actuators are provided to the controller 
by the system's sensors. In addition to this general concept, which 
is widely used in the EAI community (see e.g., Pfeifer et al., 2007), 
we introduce the notion of world to the sensorimotor loop, and by 
that we mean the system's morphology and the system's Umwelt. 
We can now distinguish between the agent's intrinsic and extrin- 
sic perspective in this context. The world is everything that is 
extrinsic from the perspective of the cognitive system, whereas 
the controller, sensor and actuator signals are intrinsic to the 
system. 

The distinction between intrinsic and extrinsic is also cap- 
tured in the representation of the sensorimotor loop as a causal or 
Bayesian graph (see Figure 1, right-hand side). The random vari- 
ables C, A, W, and S refer to the controller state, actuator signals, 
world and sensor signals, and the directed edges reflect causal 
dependencies between the random variables (see Klyubin et al., 
2004; Ay and Polani, 2008; Zahedi et al., 2010). Everything that is 
extrinsic to the system is captured in the variable W, whereas S, 
C, and A are intrinsic to the system. 
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FIGURE 1 | The sensorimotor loop. Left: Schematic diagram of a cognitive system with its interaction with the world. Right: Corresponding causal graph. 



In this context, we distinguish between internal and external 
reward function (IRF, ERF) in the following way. An ERF may 
access any variable, especially those that are not available to an 
agent by its sensors, i.e., anything that we summarized as the 
world state W. An IRF may access intrinsically available informa- 
tion only (S t At>C t , see Figure 1). We are interested in first princi- 
pled model of an intrinsic motivation, i.e., a model that requires 
as few assumptions as possible. The idea is that IRF should not 
depend on a specific task but rather be a task-independent inter- 
nal driving force, which supports any task-dependent learning. 
This is why we refer to it as task-independent internal motivation 
or drive. This closes the discussion of embodied agents and their 
formalization in terms of the sensorimotor loop. The next section 
describes the information-theoretic measures that are used in the 
remainder of this work. 

2.2. PREDICTIVE INFORMATION 

The predictive information (PI) (Bialek et al., 2001), which is 
also known as excess entropy (Crutchfield and Young, 1989) and 
effective measure complexity (Grassberger, 1986) is defined as the 
mutual information of the entire past and future of the sensor 
data stream: 



is based on maximizing the mutual information of an inten- 
tion state S t , which is internally generated by the agent, and the 
next sensor state S f +i (Ay and Zahedi, 2013). The Equation 
(2) displays how maximizing the PI affects the behavior of a 
system. The first term in Equation (2) leads to a maximiza- 
tion of the entropy over the sensor states. This means that the 
agent has to explore its world in order to sense every state with 
equal probability. The second term in Equation (2) states that 
the uncertainty of the next sensor state must be minimal if the 
current sensor state is known. This means that an agent has 
to choose actions which lead to predictable next sensor states. 
This can be rephrased in the following way. Maximizing the 
entropy H(S t + i) increases the diversity of the behavior whereas 
minimizing the conditional entropy — H(S t + 1 \S t ) increases the 
compliance of the behavior. The result is a system that explores 
its sensors space to find as many regularities in its behavior as 
possible. 

For completeness we will also maximize the entropy H(S t ) 
only and compare the results to the maximization of the PI. This 
concludes the presentation of the PI (and entropy) as a model for 
a task-independent internal motivation in the context of RL. The 
next section presents the utilized RL algorithm. 



Ipred(S) := I(S p ; S f ) (1) 

where Sp = [S\, S2, . . . , S t } is the entire past of the system's 
sensor data at some time t e N and Sf = {S t + 1, S t + 2, . . .} its 
entire future. The PI captures how much information the past 
carries about the future. Unfortunately, it cannot be calculated 
for most applications because of technical reasons. Hence, we use 
the one-step PI, which is given by 

£ ed (S):=I(S t+1 ;S t ) 

= H(S t+l ) -H(S t+l \S t ) , (2) 

diversity compliance 

which was previously investigated in the context of EAI (Ay 
et al., 2008) and as a first principle learning rule (Zahedi et al., 
2010; Martius et al, 2013). A different motivation for the PI 



2.3. POLICY GRADIENTS WITH PARAMETER-BASED EXPLORATION 
(PGPE) 

We chose an episodic RL method named PGPE (Sehnke et al., 
2010) to investigate the effect of the PI as an IRF, because it is 
not restricted to a specific class of policies. Any policy, which 
can be represented by a vector \i e R" with fixed length n e 
N + can be optimized by this method. In the work presented 
here, we use it to learn the synaptic strengths and bias val- 
ues of neural networks with fixed structures only. Nevertheless, 
we can apply the framework to other parametrizations, in par- 
ticular to stochastic policies, which is why PGPE attracted 
our attention for ongoing the project in which this work is 
embedded. 

The algorithm can be summarized in the following way (for 
details, see (Sehnke et al., 2010)). In each roll-out or episode, 
two policy instances are drawn from [L by adding and sub- 
tracting a random vector e ~ Af(0, a) to it. The resulting two 
policy parametrizations 0 + = |x + € and 0~ = u, — € are then 
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evaluated and their final rewards r + , r are used to determine the 
modifications on u, and a according to the following equations 



m" =max(m n -\r+' n ,r-< n ) 



Aa, 



b" = (l - 8)fo"- 1 + 8^] 
n 

a€,'(r + — r~) 



2m — r+ — r 
a / r + — r~ 



b\ 2 



(3) 
(4) 

(5) 
(6) 



Roll-outs can be repeated several times before a learning step is 
performed. Every learning step concludes a batch. PGPE requires 
an initial m n i t , an initial a; n i t , a learning rate a, baseline b, baseline 
adaptation parameter 8, and an initialized maximal reward m = 
"linit- We have set 8 to the recommended value of 0.1, u,i n i t = 0, 
and we have achieved the best results in all experiments by set- 
ting m m it small enough that m is definitely overwritten in the first 
roll-out (see Equation (3)). The other parameters are evaluated 
in each experiment, such that the best results were achieved when 
no IRF was used and then fixed for the remaining experiments. 

3. RESULTS 

This section presents three different experiments and their results. 
The first experiment is the cart-pole swing-up, a standard con- 
trol theory problem that is also widely used in machine learning 
(Barto et al., 1983; Geva and Sitte, 1993; Doya, 2000; Pasemann 
et al., 1999). The cart-pole experiment is also chosen because bal- 
ancing a pole minimizes the entropy, and hence, it contradicts 
the maximization of the PI. The second experiment is the learn- 
ing of a locomotion behavior for a hexapod and it was chosen to 
demonstrate the effect of the PI maximization on a more com- 
mon, well-structured experimental setting. By well-structured we 
mean that the controller, morphology, environment, and ERF 
are chosen such that they result in a good hexapod locomotion 
without any additional support by an IRF in only a few policy 
updates. The third experiment is designed to be challenging, as it 
combines a high-dimensional system, an unconventional control 
structure, an unsteady ERF with an unsteady environment. We 
believe that these three experiments span a broad range of pos- 
sible applications for information-theoretic IRF in the context of 
episodic RL. 

3.1. CART-POLE SWING-UP 

The cart-pole swing-up experiment is ideal to investigate the 
effect of the PI on an episodic RL task, mainly for two reasons. 
First, the experiment is well-defined by a set of equations and 
parameters that are widely used in literature (Barto et al., 1983; 
Geva and Sitte, 1993; Doya, 2000; Pasemann et al, 1999). This 
ensures that the results are comparable and reproducible by oth- 
ers with little effort. Second, the successful execution of the task 
contradicts the maximization of the PI. The task is to balance the 
pole in the center of the environment, and hence, to minimize the 
entropy of the sensor states. The maximization of the PI demands 



a maximization of the entropy (see Equation 2). The remainder of 
this section first describes the experimental and controller setting 
and then closes with a discussion of the results. 

The experiment was conducted by implementing the equa- 
tions that can be found in (Barto et al., 1983; Geva and Sitte, 1993; 
Doya, 2000). The state of the cart-pole is given by x, x, i?, which 
are the position of the cart, the speed of the cart, the pole angle 
and the pole's angular velocity. The cart is controlled by a force 
F € [-10N, 10N] that is applied to its center of mass. The four 
state variables and the force define the input and output config- 
uration of our controllers for this task. The initial controller (see 
Figure 2A) was chosen from (Pasemann et al, 1999), where net- 
work structures were evolved for the same task. To ensure that 
the evolved structure is not especially unsuitable for RL, different 
variations were chosen for evaluation too (see Figures 2B-D). In 
this approach, the input neurons are simple buffer neurons, with 
the identity as transfer-function, whereas all other neurons use 
the hyperbolic tangent transfer-function. 

The evaluation time was set to T = 2000 iterations, which cor- 
responds to 20 seconds (c.f. Doya, 2000). Different values, starting 
from the values proposed in (Sehnke et al, 2010), for the learning 
rate a e {(LL 0.2, 0.5}, the initial variation CTi n ; t e {2, 5}, and the 
initial maximal reward m; n ; t e {— oo, 10, 100, 1000} were evalu- 
ated in experiments without applying an IRF to the learning of the 
task. The underlined values showed the best results, and hence, 
are chosen for presentation here. Each experiment consisted of 
B = 10000 batches, i.e., updates of |x and a (see Equations 5 and 
6) with two roll-outs each (i.e., four evaluated policies 9j 2 ~). 
The results are obtained by conducting every experiment 100 
times. To ensure comparability among the experiments with dif- 
ferent parameters and controllers, the random number generator 
was initialized from a fixed set of 100 integer values for each 
experiment. 

The presentation of the reward function is split into two parts. 
The first part handles the ERF, whereas the second part handles 
the IRF. We use the terms intrinsic/internal and extrinsic/ 'external 
with respect to the agent's perspective, as discussed in the previous 
section (see Section 2.1). The controller has access to the full state 
of the system, and hence, the separation into internal and external 
is artificial in this case. Nevertheless, we keep this terminology for 
consistency, as the next experiments will reflect this distinction 
in a natural way. We denote IRF by R m and ERF by -R e x> where a 
super-script is added to distinguish between the different reward 
functions (PI and entropy). 

The ERF for the cart-pole swing-up task is defined such that it 
is not a smooth gradient in the reward space, and therefore, does 
not directly guide the learning process. The controller is only 
rewarded if the pole is pointing upwards and the reward is scaled 
with the distance of the pole to the center of the environment, 
which is given by 



R ex (t) := 



2-|x(Q| if|0(f)| <5° 
0 otherwise. 



(7) 



The IRF is calculated at the end of each episode based on 
the recordings of the pole angles [S t = $(t)\t = 1,2, ... ,T}. 
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FIGURE 2 1 Controller architectures for the cart-pole swing-up 
task. The input neurons are bare buffer neurons whereas the 
hidden and output neurons have tanh transfer-function. (A) from 



Pasemann et al. (1999); (B) with 4 hidden neurons and fully 
connected; (CD) recurrent variations without and with lateral 
connections. 



We use a discrete-valued computation of the PI, and hence, the 
data is binned prior to the calculation. All IRFs are normalized 
with respect to their theoretical upper bound of I(S f +i; S t ) < 
H(S t ) < log \S\ (see (Cover and Thomas, 2006)). This leads to the 
two following IRFs: 

Kg:=|J(S t+ i;S ( )| and Rg := \H(S t )\. (8) 

The overall reward functions are then given by 

T 

R n := £*«(f) + p(y)K£, 
r = l 

T 

R H := V R ex (t) + p(y)*g, P(y) = y • T • max {R ex (t)} (9) 

— x. V. t 

t= 1 

where P(y) is a factor to scale the IRF with respect to the maximal 
possible value of the ERF. This allows us to compare the effects of 
and Rg across different experiments. 

The results are discussed only for the fully connect feed- 
forward network (see Figures 3A-D) in detail as this controller 
shows the most distinguishable results with respect to the varia- 
tion of the IRF scaling parameter y e {0, 1.25, 2.5, 3.75, and.5%}. 
It is important to note that the plots only show the averages of the 
100 experiments and not the standard deviation for the following 
reason. Few controllers succeed early, others later during the pro- 
cess. Due to the unsteady ERF the resulting standard deviation 
is very large, as those controllers that succeed receive signifi- 
cantly higher reward compared to those not succeeding (which 
remain close to zero, as a rotational behavior is not permitted). 
We intentionally chose an unsteady ERF, that returns zero for 
almost all states, and hence, we know beforehand that the stan- 
dard deviation is large and no further information is provided if 
it is plotted. 

Figures 3A,B show the progress of the ERF R^ and IRF R?J 
for the PI maximization. It is shown that there is a significant 
speed-up in learning during the first 4000 batches for all y > 0% 
(see Figure 3A). At this point in time the average ERF of y = 0% 
succeeds that of y = 5%. After approximately 5000 batches the 
ERF for y = 2.5% and y = 3.75% are very close to or slightly 
succeeded by the ERF for y = 0%, whereas the ERF for y = 



1.25% remains higher. The conclusion from this experiment is 
that small values of y < 5% are beneficial in this learning task 
as less batches are required to solve this task and the asymp- 
totic learning performances are almost identical to y = 0%. The 
results, however, are not significant and the choice of y is crit- 
ical. This leads to the conclusion that the one-step PI is not 
significantly beneficial in the learning of this task. 

Figures 3C,D show the progress of the ERF and IRF _R H 
for the entropy maximization. The results show a different pic- 
ture. Any parameter y > 0% speeds up the learning and improves 
the overall performance. The comparison of entropy and PI is 
addressed in the discussion again. 

3.2. HEXAP0D LOCOMOTION 

If a specific task should be learned by an embodied agent, it 
is more common to choose an environment, morphology, con- 
trol structure and a smooth ERF which are well-suited for the 
desired task. In order to investigate which effect the PI has on 
such a well-defined learning task, the set-up of the experiment 
presented in this section is chosen such that all components 
are known to work well if there is no IRF present. The goal 
is to learn a locomotion behavior of a hexapod, where the 
maximal deviation angles ensure that it cannot flip over. The 
controller is known to perform well in a similar task (Markelic 
and Zahedi, 2007) and its modularity significantly reduces the 
number of parameters that must be learned. The ERF defines 
a smooth gradient in the reward space, ensuring that small 
changes in the controller parameters show an immediate effect 
in the ERF. The environment is an even plane without any 
obstacles. 

The experimental platform (see Figure 4) is a hexapod, with 12 
degrees of freedom (two actuators in each leg) and with 18 sensors 
(angular positions of the actuators and binary foot contact sen- 
sors). The two actuators of each leg are positioned in the shoulder 
(Thorax-Coxa or ThC joint) and in the knee (Femur-Tibia or 
FTi joint) of the walking machine, similar to the morphology 
presented in (von Twickel et al., 2011). We omit the second 
shoulder-joint (CTr) because it is not required for locomotion. 
Each joint accepts the desired angular position as its input and 
returns the actual current angular position as its output. The sim- 
ulator YARS (Zahedi et al, 2008) was used for all experiments 
conducted in this section. 
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FIGURE 3 1 Results for cart-pole experiments. Each row shows the 
results for one controller architecture, see Figure 2. The 
corresponding connection matrix is provided in the first column 
(gray: connection, black: no connection). For simplicity only the row 



for the second controller is discussed in detail. (A,B) ERF and IRF 
for PI maximization — small values of y > 0 are advantageous. 
(CD) ERF and IRF for entropy maximization — all values of y > 0. 
have positive effect. 




Different values for the PGPE parameters were evaluated. The 
best results for y = 0 (see Equation 9) were achieved with 0^ = 
2 and a = 0.1. To ensure comparability with the previous experi- 
ment, two roll-outs were chosen here, although it is not required 
to obtain the following results. The evaluation time was set to 
T = 1000 and B = 250 batches were sufficient to observe a con- 
vergence of the policy parameters \i. The values for y were chosen 
from the previous experiment. 

The ERF is calculated once at the end of each episode and 
it is defined as the euclidean distance between the hexapod 



at time T and its initial position (0, 0) projected onto the 
xy-plane: 

R ex :=jx 2 T +y 2 T , (10) 

where (xt, yr) are the coordinates of the center of the robot in 
world coordinates at time t = T. 

The IRF is calculated differently compared to the previous 
experiment. In a high-dimensional system as the hexapod, it is 
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not possible to compute the PI of the entire system with a rea- 
sonable effort, as the computational cost of I(S t ; Sf+i) grows 
exponentially for every new sensor. It would be natural to reduce 
the computational cost by calculating the PI based on a model 
of the morphology, but this would violate our claim that the 
PI incorporates the morphology without the need of explicitly 
modeling it. Hence, we decided to use the following method 
to approximate the PI and the entropy H (see Figure 4D). Let 
S,(f), i=l,2,..., 12, be the angular position sensors for the 12 
actuators. We then chose two sensors k, I with 1 < k, I < 12, k ^ 
I, randomly from the 12 possibles sensors, and calculated 

PI U := I (S k (t + 1), S,(t + 1); S k {t), Si(t)) 
H u :=H(S k (t),Si(t)). (11) 

The overall PI and entropy are then calculated as the sum of n ran- 
domly chosen PI U and H u pairings, with the additional constraint 
that each sensor pair k, I appears only once in the approximations. 
The resulting IRFs are then given by: 

n n 

«£:=E PJ « and < 12 > 

u= 1 u= 1 

where « is the number of pairings. For n > 20 no difference was 
found for the approximated PI, which is why n = 20 was chosen 
for the remainder of this work. 

The overall reward functions are then given by: 

R n := R ex + ^(y)R^R H := R ex + R(y)ljP (13) 

where P(y) is defined as in the cart-pole swing-up experiment (see 
Equation 9). 

A common recurrent neural network central pattern genera- 
tor layout is chosen, which can also be found in literature (e.g., 
Campos et al., 2010; von Twickel et al., 201 1; Markelic and Zahedi, 
2007), thereby using the same neuron model as in the cart-pole 
experiment (see above). As all legs in the hexapod are morpholog- 
ically equivalent, only the synaptic weights of one leg controller 
are open to parameter adaptation in the PGPE algorithm. The 
values are then copied to the other leg controllers. This reduces 
the number of parameters for the entire controller to 32 (see 
Figures 4B,C). 



The results (see Figure 5) show that neither the PI nor the 
entropy have a noticeable effect on the learning performance. The 
mean values of the 100 experiments for each parameter as well 
as the standard deviation are almost identical. This point will be 
addressed in the discussion of this work (see Section 4). 

3.3. HEXAPOD SELF-RESCUE 

The third experiment is designed to combine and extend the two 
previous experiments. It combines them as a high-dimensional 
morphology, similar to that used in the locomotion experiment, 
is trained with an unsteady ERF, which is similar to that used 
in the cart-pole experiment. It extends the previous experiments 
as the number of parameters in the controller is a magnitude 
larger and because an unconventional control structure is used for 
the desired task. The most distinctive difference to the previous 
experiments is the non-trivial environment. The next paragraphs 
will explain the experimental set-up in detail before the section 
closes with a discussion of the results. 

We used the simulated hexapod robot of the LpzRobots sim- 
ulator (Martius et al., 2012). The hexapod has 12 active and 16 
passive degrees of freedom (see Figure 6). The active joints take 
the desired next angular position as their input and deliver the 
current actual angular position as their output. The controller is 
a fully connected one-layer feed-forward neural network without 
lateral connections and the hyperbolic transfer function a t +\ = 
tanh(Ws t + v), where a t+ \ and s t are the next action and the cur- 
rent sensor values, W is the connection matrix, and v is the vector 
of biases. The resulting controller is parameterized by 156 param- 
eters, 144 for the synaptic weights and 12 for the bias values. Note, 
that the controller is generic and has no a priori structuring or 
other robot-specific details. 

The task for the hexapod is to rescue itself from a trap. For this 
purpose, it is placed in a closed rectangular arena (see Figure 7). 
The difficulty of the task is determined by the height of the 
arena's walls, denoted by h € {0.0m, 0.1m, 0.2m} (see Figure 6). 
For comparison, the length of the lower leg (up to the knees) is 
0.45 m. The size-proportion of the robot and the trap can be seen 
in Figure 6B. 

The ERF is given by 

Rex:= U x T+y 2 T-r ifj4+y 2 T -r>o (14) 

1 0 otherwise, 




FIGURE 5 | Results for hexapod locomotion task. ERF and IRF with PI maximization (A,B) and entropy maximization (C,D). No significant effect is observed. 
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FIGURE 6 | Hexapod robot for self-rescue and the experimental 
setup. (A) The robot has 6 legs where the hind legs are 10% 
larger than the other legs. Each leg has two active DoF in the hip 
joint and one passive DoF in both the knee and the ankle joint 



equipped with a spring. Additionally the whiskers have each two 
spring-joints. (B) The robot starts in the center of the trap with a 
certain barrier height and has to escape from it. The reward is the 
distance from the outside of the trap or zero otherwise. 




FIGURE 7 | Performance In the self-rescue task depending on the 
internal reward type and factor y. Plotted are the ERF and the IRF in case 
of PI (A,B,E,F,I,J) and entropy (C,D,G,H,K,L) over the number of batches for 
different values of y and barrier heights h: (A-D) no barrier (h = 0), (E-H) low 



barrier (h = 0.1) and (l-L) high barrier [h = 0.2). For each value of y the mean 
and standard deviation of 30 experiments are displayed. In all cases a 
speed-up in learning is achieved with IRF however, the asymptotic 
performance is worse. 



where r is the radius of the trap (Figure 6) and (xj, yr) is the posi- 
tion of the center of the robot in world coordinates at the end of a 
roll-out (f = T). The IRFs and overall reward functions are iden- 
tical to those used in the previous experiment (see Equations (11) 
and (12)). 

As before, the performance of PGPE with y = 0 for different 
values for 0; n i t and a were evaluated, and the best are chosen 
for presentation here, which are cr; n i t = 2 and a = 0.5. A differ- 
ent learning rate a 0 = 0.05 was chosen for the update of cr (see 
Equation 3). Each episode consisted of T = 1250 iterations (25s) 
with one roll-out per episode. A total of B = 5000, 7000, and 
35000 batches were conducted for the different heights h. 



We compare the performance for different values of the IRF 
factory e {0, 0.05, 1, 5, and 25%} and performed 30 experiments 
for each setting. Figure 7 displays the results. As for the cart-pole 
experiment, the plots for the PI and entropy in Figure 7 report a 
clear picture of an exploration phase (high value) followed by an 
exploitation phase (lower value). 

To compare the results, we set two threshold values at R ex = 5 
and R ex = 20 which refer to a 5m and 20m distance between the 
hexapod and the walls of the arena. The first threshold reflects 
a successful learning of the task, because it means that hexapod 
reliably escapes the arena. The second threshold represents the 
case when in addition also a high locomotion speed is achieved 
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after a successful escape. For the simplicity of argumentation, 
we compare two cases, i.e., y = 0% and y = 1%. If there is no 
wall (h = 0m) the system with IRF y = 1% requires only half 
the amount of batches compared to no IRF (250 batches vs. 500 
batches, see Figures 7A,C). For the arena with a medium height 
(h = 0.1m), the learning success speed ratio increases to approx- 
imately three (350 batches vs. 1100 batches, see Figures 7E,F). 
The results are decisive for the arena with high walls (h = 0.2m), 
as the system with IRF requires about 1000 batches on average 
compared to the 5000 batches on average that a required by the 
systems without IRF (see Figures 7I,K). 

This leads to the conclusion that both, PI and entropy, are ben- 
eficial if the short-term learning success is of the primary interest. 
However, the asymptotic learning success of those hexapods with 
IRF is either equal or lower compared to those without an IRF 
in all experiments. This is valid for the one-step PI and for the 
entropy. Thus, both are not necessarily beneficial if the long-term, 
asymptotic learning performance in an episodic policy gradient 
setting is important. 

4. DISCUSSION 

This paper discussed the one-step PI (Bialek et al, 2001) as an 
information-driven intrinsic reward in the context of an episodic 
policy gradient method. The reward is considered to be intrinsic, 
because it is task-independent and it relies only on the informa- 
tion of the sensors of an agent, which, by definition, represent the 
agent's intrinsic view on the world. We chose the maximization 
of the one-step PI as an IRF, because it has proved to encourage 
behaviors which show properties of morphological computation 
without the need to model the morphology (Zahedi et al., 2010). 

The IRF was linearly combined with a task-dependent ERF in 
an episodic RL setting. Specifically, PGPE (Sehnke et al, 2010) 
was chosen as RL method, because it allows to learn arbitrary 
policy parametrizations. Within this set-up, three different types 
of experiments were performed. The following paragraph will 
summarize the results before they are discussed. 

The first experiment was the learning of the cart-pole swing- 
up task. Four controllers were evaluated of which three were less 
successful and one showed good results. The ERF was designed 
to be difficult to maximize without the IRF, and the task con- 
tradicted the maximization of the entropy and PI. The best 
controller did not show a significant improvement of the learning 
performance with respect to its asymptotic behavior. An improve- 
ment could only be observed during the first learning steps. 
Moreover, the choice of the linear combination factor y is criti- 
cal. For all controllers a minor and not significant improvement 
is observable. In case of the entropy maximization, any factor 
y > 0% showed an improvement in learning speed and learning 
performance. 

A locomotion behavior was learned for a hexapod in the sec- 
ond experiment. The entire set-up used well-known components 
for the environment, modular controller, ERF, and morphology 
so that the task was solved without IRF in only a few learning 
steps. No effect of the PI and entropy was observed. 

The third experiment combined the previous two and 
extended them by a non-trivial environment. A hexapod had to 
escape from a trap and was only rewarded outside of it. The 



results showed no significant difference between the PI and the 
entropy as IRFs. The learning speed was significantly improved 
by both IRFs with increasing difficulty of the task. The asymp- 
totic performance was either equal or worse when an IRF was 
introduced. 

The hexapod locomotion experiment teaches us that the 
information-theoretic reward functions (PI and entropy) has no 
effect in well-defined experimental set-ups. 

The cart-pole and the hexapod self-rescue experiments teach 
us that the maximal values of the IRF should be around one per- 
cent of the maximal ERF value to improve the learning speed and 
learning performance in the short-term. The asymptotic behav- 
ior is either not or negatively effected by the one-step PI. The 
cart-pole experiment indicates that maximizing the entropy is 
superior to maximizing the PI, whereas the hexapod self-rescue 
does not show such a clear picture. The success of the entropy in 
both experiments is explained by the ERFs. Due to their nature, 
random changes in the policy parameters are unlikely to result in 
changes in the ERF during the first batches. Hence, maximizing 
the entropy results in an exploration until the ERF is triggered. 

The PI, defined as the entropy over the sensor states subtracted 
by the conditional entropy of consecutive sensor states does not 
result in superior results for the cart-pole compared to just using 
the entropy for the following reason. In this set-up, the mor- 
phology and environment are very simple and deterministic, and 
therefore, do not produce any noise or other uncertainties in the 
sensor data stream. The uncertainty about the next possible angu- 
lar position of the pole is small, if the current pole position is 
known. In other words, the cart-pole system is regular by defi- 
nition and no further regularities can be found by maximizing 
the PI. We speculate that the conditional entropy, which cannot 
be reduced by the learning in this setting, dampens the explo- 
ration effect of the entropy term in the PI maximization. For the 
hexapod rescue experiment, the situation is different. There is an 
uncertainty about the next sensor state, given the current sensor 
state which result from the morphology and the construction of 
the arena. The PI maximization is able to find regularities which 
can then be exploited to maximize the ERF in the RL setting. 

The results contradict our intuition, as the one-step pre- 
dictive information has shown good results when applied as 
an information-driven self-organization principle in the context 
of embodied artificial intelligence (Zahedi et al, 2010; Martius 
et al, 2013). The intuitively plausible next step was to guide 
the information-driven self-organization toward solving a goal 
by combining it with an external reward signal in an reinforce- 
ment learning context. The approach evaluated in this paper was 
to linearly combine the PI with and external reward signal in an 
episodic policy gradient learning. If anything, then the PI showed 
positive short-term results, if the world was considerably prob- 
abilistic and if the external reward was sparse. Compared to no 
intrinsic reward the PI showed negative results for its asymp- 
totic behavior. The performance of the PI was either equal or 
worse compared to the entropy in all cases. This leads to the 
conclusion that research in the context of information-driven 
intrinsic rewards and reinforcement learning should be carried 
out in other directions, which are briefly described in the final 
paragraph. 
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We have used a constant combination factor y for all experi- 
ments presented in this work. It is known from general learning 
theory that a decaying learning rate is required for the conver- 
gence of a system. We chose not to use a decaying learning factor, 
because this means that the internal drive is slowly dampened 
until its effect is neglectable (at least in a technical applica- 
tion). This would contradict the idea of motivation-driven and 
open-ended learning of embodied agents. However, the results of 
our present paper reveal a disadvantage of this approach in the 
asymptotic limit, and therefore, suggest, contrary to our origi- 
nal thoughts, to pursue a strategy with a decaying combination 
factor. The second possible modification of this approach is to 
exchange the linear combination of the internal and external 
reward by a non-linear function, of which multiplicative and 
exponential functions are two examples. Third, using a gradi- 
ent of the PI instead of a random exploration in the context 



of RL is a promising approach that is currently investigated. In 
this approach, we will use a gradient on an estimate of the PI 
and not the error of a predictor as in e.g., (Schmidhuber, 1991). 
Fourth, we will continue to evaluate other information-theoretic 
measures in the context of task-dependent learning with the 
support of information-driven intrinsic motivation. In addition 
to using correlation measures, such as the mutual information, 
we believe that causal measures in the sensorimotor loop (Ay 
and Zahedi, 2013), such as the measure considered in (Zahedi 
and Ay, 2013), are good candidates for future research in this 
field. ' 
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